Statistically Debugging Massively-Parallel Applications
نویسندگان
چکیده
Statistical debugging identifies program behaviors that are highly correlated with failures. Traditionally, this approach has been applied to desktop software on which it is effective in identifying the causes that underlie several difficult classes of bugs including: memory corruption, non-deterministic bugs, and bugs with multiple temporally-distant triggers. The domain of scientific computing offers a new target for this type of debugging. Scientific code is run at massive scales offering massive quantities of statistical feedback data. Data collection can scale well because it requires no communication between compute nodes. Unfortunately, existing statistical debugging techniques impose run-time overhead that is unsuitable for computationallyintensive code despite being modest and acceptable in desktop software. Additionally, the normal communication that occurs between nodes in parallel jobs violates a key assumption of statistical independence in existing statistical models. We report on our experience bringing statistical debugging to the domain of scientific computing. We present techniques to reduce the run-time overhead of the required instrumentation by up to 25% over prior work, along with challenges related to data collection. We also discuss case studies looking at real bugs in ParaDiS and BOUT++, as well as some manually-seeded bugs. We demonstrate that the loss of statistical independence between runs is not a problem in practice.
منابع مشابه
Interactive Debugging and Performance Analysis of Massively Parallel Applications
In the eld of high performance computing, massively parallel processing systems (MPPs) get more and more important. A rising number of complex applications is parallelized for execution on these machines. Still a signiicant portion of the time needed for parallelization is spent for the process of debugging and performance tuning. A main reason for this fact is the absence of adequate tools sup...
متن کاملGrid-based Workflow Management for Automatic Performance Analysis of Massively Parallel Applications
Many Grid infrastructures have begun to offer services to end-users during the past several years with an increasing number of complex scientific applications and software tools that require seamless access to different Grid resources via Grid middleware during one workflow. End-users of the rather hpc-driven deisa Grid infrastructure take not only advantage of Grid workflow management capabili...
متن کاملThe Illinois Concert System: Programming Support for Irregular Parallel Applications
Irregular applications are critical to supporting grand challenge applications on massively parallel machines and extending the utility of those machines beyond the scientiic computing domain. The dominant parallel programmingmodels, data parallel and explicit message passing, provide little support for programming irregular applications. We articulate a set of requirements for supporting irreg...
متن کاملAn Overview of OCore : A Massively Parallel Object-based Language
In this paper we propose a massively parallel object-based language, OCore, as a research vehicle for massively parallel computation models. In addition to the fundamentals of existing parallel object-oriented languages, OCore introduces the notion of community , a structured set of objects that makes the distributed processing of messages possible together with their e cient implementation. OC...
متن کاملA Tool for On-line Visualization and Interactive Steering of Parallel HPC Applications
Tools for parallel systems today range from specification over debugging to performance analysis and more. Typically, they help the programmers of parallel algorithms from the early development stages to a certain level of program optimization. However, in HPC (High Performance Computing) today the end-user of massively parallel CFD (Computational Fluid Dynamics)-programs has little or no suppo...
متن کامل